SearCas 1 寻踪觅Depth

三月 09, 2019

cover

回到原点

综述

利用2D图像还原3D深度信息是计算机视觉领域的一个基本问题，是场景理解和3D重建的重要步骤。在3D Depth、VR、AR等诸多领域，对深度的研究主要分为Binocular和Monocular两条路线：

Binocular Depth Estimation，早期集中于对stereo和vergence线索的利用和研究。现今主要集中于Stereo Vision领域，利用一对stereo图像，通过各类Stereo Matching算法完成DIsparity的计算和改良（refine）。
- In humans, stereoscopically presented images provide information about depth. Julesz showed that random dot stereograms provide a cue for disparity even when each image does not provide any high-level cue for depth.
- Recent research in 3D vision employing active camera systems has demonstrated the integrated use of the binocular cues of stereo and vergence and the monocular cue of focus in estimating scene surfaces or obtaining range measurements.
- Stereo matching aims to estimate correspondences of all pixels between two rectified images. It is a core problem for many stereo vision tasks.
- Early works focused on depth estimation from stereo images by developing geometry-based algorithms that rely on point correspondences between images and triangulation to estimate the depth.
- For pixel (x,y) in the reference image, if its corresponding disparity is d(x,y), then the depth of this pixel could be calculated by f∗B/d(x,y), where f is the camera’s focal length and B is the distance between two camera centers.
Monocular Depth Estimation，早期工作集中于在多张图片构成的如motion、video sequences、defocus上的深度估计算法的研究。尽管对于texture、gradients、color、defocus等线索有着广泛的运用，但由于intrinsic ambiguity，Monocular Depth Estimation实际上是一个ill-posed problem，因此在single monocular image上的深度估计一直进展缓慢，直到DCNN广泛运用的今天才有所改善。
- Compared to depth estimation from stereo images or video sequences, in which significant progresses have been made, the progress of MDE is slow.
- Some methods can estimate 3-d models from a single image, but they make strong assumptions about the scene and work in specific settings only.
- Inferring the 3-d structure remains extremely challenging for current computer vision systems—there is an intrinsic ambiguity between local image features and the 3-d location of the point, due to perspective projection.
- Humans use various monocular cues to infer the 3-d structure of the scene. Some of the cues are local properties of the image, such as texture variations and gradients, color, haze, defocus, etc. Local image cues alone are usually insufficient to infer the 3-d structure. The ability of humans to “integrate information” over space, i.e., understanding the relation between different parts of the image, is crucial to understanding the 3-d structure

关于depth estimation from motion or video sequences属于Monocular Depth Estimation没有明显论据。

详述

Binocular Depth Estimation(Stereo)

要点

主要难点在于Stereo Occlusions（如下图）、object boundaries、blur、low or repetitive textures、recording and illumination differences等。

occlusion

Stereo Matching主要被分为四个步骤：

matching cost calculation
matching cost aggregation
disparity calculation//optimization
disparity refinement

发展历程

without CNNs

Matching cost computation is very often based on the absolute, squared, or sampling insensitive difference of intensities or colors. Since these costs are sensitive to radiometric differences, costs based on image gradients are also used.Mutual Information (MI) has been adapted for stereo matching and approximated for faster computation.

Cost aggregation connects the matching costs within a certain neighborhood. Often, costs are simply summed over a fixed sized window at constant disparity. Some methods additionally weight each pixel within the window according to color similarity and proximity to the center pixel. Another possibility is to select the neighborhood according to segments of constant intensity or color.

Disparity computation is done for local algorithms by selecting the disparity with the lowest matching cost, that is, winner takes all. Global algorithms typically skip the cost aggregation step and define a global energy function that includes a data term and a smoothness term. The former sums pixelwise matching costs, whereas the latter supports piecewise smooth disparity selection.

Disparity refinement is often done for removing peaks, checking the consistency, interpolating gaps, or increasing the accuracy by subpixel interpolation.

reorganized from Stereo Processing by Semiglobal Matching and Mutual Information

with CNNs

Zbontar and LeCun first introduced CNN to calculate the matching cost to measure the similarity of two pixels of two images. This method achieved the best performance on the KITTI 2012, KITTI 2015 and Middlebury stereo datasets at that time. Following the work, several methods were proposed to improve the computational efficiency or matching accuracy.

the matching cost calculation, matching cost aggregation and disparity calculation steps can be seamlessly integrated into a CNN to directly estimate the disparity from stereo images.

If all steps are integrated into a whole network for joint optimization, better disparity estimation performance can be expected. However, it is non-trivial to integrate the disparity refinement step with the other three steps. Existing methods used additional networks for disparity refinement.

extracted from Learning for Disparity Estimation through Feature Constancy

Papers

此部分展示了一些目前表现最好的模型：

Learning Depth with Convolutional Spatial Propagation Network

pdf

深度预测是计算机视觉中的基本问题之一。在本文中，我们提出了一种简单而有效的卷积空间传播网络（CSPN）来学习各种深度估计任务的亲和度矩阵。具体地说，它是一种有效的线性传播模型，其中以递归卷积运算的方式执行传播，并且通过深度卷积神经网络（CNN）学习相邻像素之间的亲和度。我们可以将此模块附加到来自最先进（SOTA）深度估计网络的任何输出，以改善其性能。在实践中，我们进一步扩展了CSPN的两个方面：1）允许它采用稀疏深度图作为附加输入，这对深度完成任务很有用; 2）扩展到3D CSPN以处理具有一个附加维度的特征，类似于常用的3D卷积操作。它在使用3D成本量的立体匹配任务中是有效的。对于单个图像深度估计和深度完成的任务，我们在流行的NYU v2和KITTI数据集上实验所提出的CPSN合取算法，其中我们表明我们提出的算法不仅产生高质量（例如，30比以前的SOTA空间传播网络运行得更快（例如，2到5倍更快）。我们还在场景流[8]和KITTI立体声数据集，上评估了我们的立体匹配算法，并在KITTI Stereo 2012和2015基准测试中排名第一，这证明了所提模块的有效性。

CSPN

insights: pass

comments: pass

Learning for Disparity Estimation through Feature Constancy

CVPR 2018 and ROB 2018 pdf

立体匹配算法通常包括四个步骤，包括匹配成本计算，匹配成本聚合，视差计算和视差细化。现有的基于CNN的方法仅采用CNN来解决四个步骤中的部分，或者使用不同的网络来处理不同的步骤，使得难以获得整体最优解。在本文中，我们提出了一种网络架构，其中包含了立体匹配的所有步骤。该网络由三部分组成。第一部分计算多尺度共享特征。第二部分执行匹配成本计算，匹配成本聚合和视差计算以使用共享特征估计初始视差。初始视差和共享特征用于计算测量两个输入图像之间的对应关系的正确性的特征恒常性。然后将初始视差和特征恒定性馈送到子网络中以细化初始视差。已经在Scene Flow和KITTI数据集上评估了所提出的方法。它在KITTI 2012和KITTI 2015基准测试中实现了最先进的性能，同时保持了非常快的运行时间。

iResNet

insights: pass

comments: pass

Monocular Depth Estimation(Single image)

要点

MDE的主要难点：模糊性inherent ambiguity

单个2D图像可能代表无数个不同的3D模型，但只有少数是有效的：

MDE is an ill-posed problem: a single 2D image may be produced from an infinite number of distinct 3D scenes.
单独的局部图像线索通常不足以推断3-d结构。人类在空间上“整合信息”的能力，即理解图像不同部分之间的关系，对于理解三维结构至关重要。

There is an intrinsic ambiguity between local image features and the 3-d location of the point, due to perspective projection.

Local image cues alone are usually insufficient to infer the 3-d structure. The ability of humans to “integrate information” over space, i.e., understanding the relation between different parts of the image, is crucial to understanding the 3-d structure.

发展历程

Saxena et al. learned the depth from monocular cues in 2D images via supervised learning.

Since then, a variety of approaches have been proposed to exploit the monocular cues using handcrafted representations

Since handcrafted features alone can only capture local information, probabilistic graphic models such as Markov Random Fields (MRFs) are often built based on these features to incorporate long-range and global cues

Another successful way to make use of global cues is the DepthTransfer method which uses GIST global scene features to search for candidate images that are “similar” to the input image from a database containing RGBD images.

with the use of DCNN-based models, demonstrating that deep features are superior to handcrafted features. Thanks to multi-level contextual and structural information from powerful very deep networks (e.g., VGG and ResNet), depth estimation has been boosted to a new accuracy level.

Remove the repeated spatial pooling operations and obtain high-resolution depth maps by incorporating higher-resolution feature maps via multi-layer deconvolutional networks, multiscale networks or skip-connection .

To improve efficiency, the Neural Regression Forest method which allows for parallelizable training of “shallow” CNNs is proposed.

Recently, unsupervised or semi-supervised learning is introduced to learn depth estimation networks.

reorganized from Deep Ordinal Regression Network for Monocular Depth Estimation

Papers

此部分展示了一些目前表现最好的模型：

Deep Ordinal Regression Network for Monocular Depth Estimation

CVPR2018 pdf

单眼深度估计在理解3D场景几何中起着至关重要的作用，是一个不适定的问题。通过深度卷积神经网络（DCNN）探索图像级信息和分层特征，最近的方法得到了显着的改进。这些方法将深度估计建模为回归问题，并通过最小化均方误差来训练回归网络，均方误差受到收敛缓慢和不满意的局部解的影响。此外，现有的深度估计网络采用重复的空间池操作，导致不期望的低分辨率特征图。为了获得高分辨率深度图，需要跳过连接或多层反卷积网络，这使网络训练复杂化并且消耗更多的计算。为了消除或至少在很大程度上减少这些问题，我们引入间距增加离散化（SID）策略来将深度和重铸深度网络学习离散化为序数回归问题。通过使用普通的回归损耗训练网络，我们的方法实现了更高的准确性和更快的同步收敛。此外，我们采用多尺度网络结构，避免不必要的空间池并且并行捕获多尺度信息。建议的深度序数回归网络（DORN）在三个具有挑战性的基准上实现了最先进的结果，即KITTI，Make3D和NYU Depth v2 ，并且大大优于现有方法余量。

dorn

insight:

instead of downsampling operations, dilated convolutions are used
The full-image encoder(contains fewer parameters) captures global contextual information and can greatly clarify local confusions in depth estimation
perform the discretization using the SID strategy, which uniformed discretizes a given depth interval in log space to down-weight the training losses in regions with large depth values, so that our depth estimation network is capable to more accurately predict relatively small and medium depth and to rationally estimate large depth values.
cast the depth estimation problem as an ordinal regression problem and develop an ordinal loss to learn our network parameters.
网络结构简单, 结果吊打一切:

won the 1st prize in Robust Vision Challange 2018. We ranked 1st place on both KITTI and ScanNet. Slides can be downloaded here.

comment:

之前一直忽视了depth尺度问题, 直接线性缩放了:

To quantize a depth interval [α, β] into a set of representative discrete values, a common way is the uniform discretization (UD). However, as the depth value becomes larger, the information for depth estimation is less rich, meaning that the estimation error of larger depth values is generally larger. Hence, using the UD strategy would induce an over-strengthened loss for the large depth values.
这篇文章让我感觉水的地方是同一段话在不同的地方出现了三次, 又在related works部分拆开写了一次.
代码: caffe pytorch

Deep attention-based classification network for robust depth prediction

CVPR2018 pdf

在本文中，我们在鲁棒视觉挑战2018（ROB 2018）的背景下提出了基于注意力的深度分类（DABC）网络，用于稳健的单图像深度预测。与传统的深度预测不同，我们的目标是设计一个能够在单个参数集的室内和室外场景中表现良好的模型。然而，强大的深度预测存在两个具有挑战性的问题：a）如何为不同的场景提取更多的判别特征（与单个场景相比）？ b）如何处理室内和室外数据集之间的深度差异？为了解决这两个问题，我们首先将深度预测表示为多类分类任务，并应用softmax分类器对每个像素的深度标签进行分类。然后，我们引入全局池化层和通道关注机制，以自适应地选择特征的判别通道，并通过分配具有更高权重的重要通道来更新原始特征。此外，为了减少量化误差的影响，我们采用软加权和推理策略进行最终预测。室内和室外数据集的实验结果证明了我们的方法的有效性。值得一提的是，我们在ROB 2018的单图像深度预测条目中与IEEE计算机视觉和模式识别会议（CVPR）2018一起赢得了第二名。

dabc

insight:

RESNET & SENET & UNET combined,
depth discretization 同 DORN SID

to tolerate the significant difference of depth ranges, we formulate depth prediction as a multi-class classification problem. By discretizing continuous depth values into several intervals, we choose a softmax layer as the predictor. In this way, each neuron in the last layer only needs to activate for a specific depth interval rather than to predict the depth value in the whole depth range, which makes the model easy to train.
believe that the channel-wise attention mechanism plays a vital role in choosing discriminative features for diverse scenes and improving the performance in robust depth prediction.

comment:

RESNET & SENET & UNET 典型, 模型水
效果小于等于DORN
证明了仍有更多的领域亟待水模型滋养啊, 只要你足够快.

来源

查看评论